I have created an appropriate analysis dataset and excluded unwanted observations. Here we have analysis on the dataset and it’s variables. - Describe your work and results clearly. - Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods - Assume the reader has no previous knowledge of your data or the more advanced methods you are using
For these excercises the libraries “dplyr”, “ggplot2” and “GGally” are necessary. When reading the code these need to be installed and read. I have also included cache = F, since without it my version of R refused to knit the plots I made
cache=F
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("ggplot2")
library("GGally")
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library("lattice")
Students2014 <- read.table("~/Documents/IODS-project/data/learning2014/learning2014.txt", header = TRUE, sep = " ")
print(Students2014)
## gender Age Attitude deep stra surf Points
## 1 F 53 37 3.583333 3.375 2.583333 25
## 2 M 55 31 2.916667 2.750 3.166667 12
## 3 F 49 25 3.500000 3.625 2.250000 24
## 4 M 53 35 3.500000 3.125 2.250000 10
## 5 M 49 37 3.666667 3.625 2.833333 22
## 6 F 38 38 4.750000 3.625 2.416667 21
## 7 M 50 35 3.833333 2.250 1.916667 21
## 8 F 37 29 3.250000 4.000 2.833333 31
## 9 M 37 38 4.333333 4.250 2.166667 24
## 10 F 42 21 4.000000 3.500 3.000000 26
## 11 M 37 39 3.583333 3.625 2.666667 31
## 12 F 34 38 3.833333 4.750 2.416667 31
## 13 F 34 24 4.250000 3.625 2.250000 23
## 14 F 34 30 3.333333 3.500 2.750000 25
## 15 M 35 26 4.166667 1.750 2.333333 21
## 16 F 33 41 3.666667 3.875 2.333333 31
## 17 F 32 26 4.083333 1.375 2.916667 20
## 18 F 44 26 3.500000 3.250 2.500000 22
## 19 M 29 17 4.083333 3.000 3.750000 9
## 20 F 30 27 4.000000 3.750 2.750000 24
## 21 M 27 39 3.916667 2.625 2.333333 28
## 22 M 29 34 4.000000 2.375 2.416667 30
## 23 F 31 27 4.000000 3.625 3.000000 24
## 24 F 37 23 3.666667 2.750 2.416667 9
## 25 F 26 37 3.666667 1.750 2.833333 26
## 26 F 26 44 4.416667 3.250 3.166667 32
## 27 M 30 41 3.916667 4.000 3.000000 32
## 28 F 33 37 3.750000 3.625 2.000000 33
## 29 F 33 25 3.250000 2.875 3.500000 29
## 30 M 28 30 3.583333 3.000 3.750000 30
## 31 M 26 34 4.916667 1.625 2.500000 19
## 32 F 27 32 3.583333 3.250 2.083333 23
## 33 F 25 20 2.916667 3.500 2.416667 19
## 34 F 31 24 3.666667 3.000 2.583333 12
## 35 M 20 42 4.500000 3.250 1.583333 10
## 36 F 39 16 4.083333 1.875 2.833333 11
## 37 M 38 31 3.833333 4.375 1.833333 20
## 38 M 24 38 3.250000 3.625 2.416667 26
## 39 M 26 38 2.333333 2.500 3.250000 31
## 40 M 25 33 3.333333 1.250 3.416667 20
## 41 F 30 17 4.083333 4.000 3.416667 23
## 42 F 25 25 2.916667 3.000 3.166667 12
## 43 M 30 32 3.333333 2.500 3.500000 24
## 44 F 48 35 3.833333 4.875 2.666667 17
## 45 F 24 32 3.666667 5.000 2.416667 29
## 46 F 40 42 4.666667 4.375 3.583333 23
## 47 M 25 31 3.750000 3.250 2.083333 28
## 48 F 23 39 3.416667 4.000 3.750000 31
## 49 F 25 19 4.166667 3.125 2.916667 23
## 50 F 23 21 2.916667 2.500 2.916667 25
## 51 M 27 25 4.166667 3.125 2.416667 18
## 52 M 25 32 3.583333 3.250 3.000000 19
## 53 M 23 32 2.833333 2.125 3.416667 22
## 54 F 23 26 4.000000 2.750 2.916667 25
## 55 F 23 23 2.916667 2.375 3.250000 21
## 56 F 45 38 3.000000 3.125 3.250000 9
## 57 F 22 28 4.083333 4.000 2.333333 28
## 58 F 23 33 2.916667 4.000 3.250000 25
## 59 M 21 48 3.500000 2.250 2.500000 29
## 60 M 21 40 4.333333 3.250 1.750000 33
## 61 F 21 40 4.250000 3.625 2.250000 33
## 62 F 21 47 3.416667 3.625 2.083333 25
## 63 F 26 23 3.083333 2.500 2.833333 18
## 64 F 25 31 4.583333 1.875 2.833333 22
## 65 F 26 27 3.416667 2.000 2.416667 17
## 66 M 21 41 3.416667 1.875 2.250000 25
## 67 F 23 34 3.416667 4.000 2.833333 28
## 68 F 22 25 3.583333 2.875 2.250000 22
## 69 F 22 21 1.583333 3.875 1.833333 26
## 70 F 22 14 3.333333 2.500 2.916667 11
## 71 F 23 19 4.333333 2.750 2.916667 29
## 72 M 22 37 4.416667 4.500 2.083333 22
## 73 M 23 32 4.833333 3.375 2.333333 21
## 74 M 24 28 3.083333 2.625 2.416667 28
## 75 F 22 41 3.000000 4.125 2.750000 33
## 76 F 23 25 4.083333 2.625 3.250000 16
## 77 M 22 28 4.083333 2.250 1.750000 31
## 78 M 20 38 3.750000 2.750 2.583333 22
## 79 M 22 31 3.083333 3.000 3.333333 31
## 80 M 21 35 4.750000 1.625 2.833333 23
## 81 F 22 36 4.250000 1.875 2.500000 26
## 82 F 23 26 4.166667 3.375 2.416667 12
## 83 M 21 44 4.416667 3.750 2.416667 26
## 84 M 22 45 3.833333 2.125 2.583333 31
## 85 M 29 32 3.333333 2.375 3.000000 19
## 86 F 29 39 3.166667 2.750 2.000000 30
## 87 F 21 25 3.166667 3.125 3.416667 12
## 88 M 28 33 3.833333 3.500 2.833333 17
## 89 F 21 33 4.250000 2.625 2.250000 18
## 90 F 30 30 3.833333 3.375 2.750000 19
## 91 F 21 29 3.666667 2.250 3.916667 21
## 92 M 23 33 3.833333 3.000 2.333333 24
## 93 F 21 33 3.833333 4.000 2.750000 28
## 94 F 21 35 3.833333 3.500 2.750000 17
## 95 F 20 36 3.666667 2.625 2.916667 18
## 96 M 22 37 4.333333 2.500 2.083333 17
## 97 M 21 42 3.750000 3.750 3.666667 23
## 98 M 21 32 4.166667 3.625 2.833333 26
## 99 F 20 50 4.000000 4.125 3.416667 28
## 100 M 22 47 4.000000 4.375 1.583333 31
## 101 F 20 36 4.583333 2.625 2.916667 27
## 102 F 20 36 3.666667 4.000 3.000000 25
## 103 M 24 29 3.666667 2.750 2.916667 23
## 104 F 20 35 3.833333 2.750 2.666667 21
## 105 F 19 40 2.583333 1.375 3.000000 27
## 106 F 21 35 3.500000 2.250 2.750000 28
## 107 F 21 32 3.083333 3.625 3.083333 23
## 108 F 22 26 4.250000 3.750 2.500000 21
## 109 F 25 20 3.166667 4.000 2.333333 25
## 110 F 21 27 3.083333 3.125 3.000000 11
## 111 F 22 32 4.166667 3.250 3.000000 19
## 112 F 25 33 2.250000 2.125 4.000000 24
## 113 F 20 39 3.333333 2.875 3.250000 28
## 114 M 24 33 3.083333 1.500 3.500000 21
## 115 F 20 30 2.750000 2.500 3.500000 24
## 116 M 21 37 3.250000 3.250 3.833333 24
## 117 F 20 25 4.000000 3.625 2.916667 20
## 118 F 20 29 3.583333 3.875 2.166667 19
## 119 M 31 39 4.083333 3.875 1.666667 30
## 120 F 20 36 4.250000 2.375 2.083333 22
## 121 F 22 29 3.416667 3.000 2.833333 16
## 122 F 22 21 3.083333 3.375 3.416667 16
## 123 M 21 31 3.500000 2.750 3.333333 19
## 124 M 22 40 3.666667 4.500 2.583333 30
## 125 F 21 31 4.250000 2.625 2.833333 23
## 126 F 21 23 4.250000 2.750 3.333333 19
## 127 F 21 28 3.833333 3.250 3.000000 18
## 128 F 21 37 4.416667 4.125 2.583333 28
## 129 F 20 26 3.500000 3.375 2.416667 21
## 130 F 21 24 3.583333 2.750 3.583333 19
## 131 F 25 30 3.666667 4.125 2.083333 27
## 132 M 21 28 2.083333 3.250 4.333333 24
## 133 F 24 29 4.250000 2.875 2.666667 21
## 134 F 20 24 3.583333 2.875 3.000000 20
## 135 M 21 31 4.000000 2.375 2.666667 28
## 136 F 20 19 3.333333 3.875 2.166667 12
## 137 F 20 20 3.500000 2.125 2.666667 21
## 138 F 18 38 3.166667 4.000 2.250000 28
## 139 F 21 34 3.583333 3.250 2.666667 31
## 140 F 19 37 3.416667 2.625 3.333333 18
## 141 F 21 29 4.250000 2.750 3.500000 25
## 142 F 20 23 3.250000 4.000 2.750000 19
## 143 M 21 41 4.416667 3.000 2.000000 21
## 144 F 20 27 3.250000 3.375 2.833333 16
## 145 F 21 35 3.916667 3.875 3.500000 7
## 146 F 20 34 3.583333 3.250 2.500000 21
## 147 F 18 32 4.500000 3.375 3.166667 17
## 148 M 22 33 3.583333 4.125 3.083333 22
## 149 F 22 33 3.666667 3.500 2.916667 18
## 150 M 24 35 2.583333 2.000 3.166667 25
## 151 F 19 32 4.166667 3.625 2.500000 24
## 152 F 20 31 3.250000 3.375 3.833333 23
## 153 F 20 28 4.333333 2.125 2.250000 23
## 154 F 17 17 3.916667 4.625 3.416667 26
## 155 M 19 19 2.666667 2.500 3.750000 12
## 156 F 20 35 3.083333 2.875 3.000000 32
## 157 F 20 24 3.750000 2.750 2.583333 22
## 158 F 20 21 4.166667 4.000 3.333333 20
## 159 F 20 29 4.166667 2.375 2.833333 21
## 160 F 19 19 3.250000 3.875 3.000000 23
## 161 F 19 20 4.083333 3.375 2.833333 20
## 162 F 22 42 2.916667 1.750 3.166667 28
## 163 M 35 41 3.833333 3.000 2.750000 31
## 164 F 18 37 3.166667 2.625 3.416667 18
## 165 F 19 36 3.416667 2.625 3.000000 30
## 166 M 21 18 4.083333 3.375 2.666667 19
#2.
dim(Students2014)
## [1] 166 7
str(Students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
Second we display graphical implementation of the data
Overview:
plot(Students2014)
Results without variable ‘gender’: pairs(Students2014[-1])
pairs(Students2014[-1])
Last, a ggpairs-graphic for possibly a more clear display:
ggpairs(Students2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
Next we have summaries for the included variables and scatterplots to clarify their influence on the ‘points’ -variable:
library(ggplot2)
qplot(Attitude, Points, data = Students2014) + geom_smooth(method = "lm")
qplot(Age, Points, data = Students2014) + geom_smooth(method = "lm")
qplot(gender, Points, data = Students2014) + geom_smooth(method = "lm")
qplot(deep, Points, data = Students2014) + geom_smooth(method = "lm")
qplot(stra, Points, data = Students2014) + geom_smooth(method = "lm")
qplot(surf, Points, data = Students2014) + geom_smooth(method = "lm")
Next some more illustration of the effects of the other variables to the ‘points’-variable and summaries of these
M_Attitude <- lm(Points ~ Attitude, data = Students2014)
summary(M_Attitude)
##
## Call:
## lm(formula = Points ~ Attitude, data = Students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
M_Age <- lm(Points ~ Age, data = Students2014)
summary(M_Age)
##
## Call:
## lm(formula = Points ~ Age, data = Students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0360 -3.7531 0.0958 4.6762 10.8128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.52150 1.57339 15.585 <2e-16 ***
## Age -0.07074 0.05901 -1.199 0.232
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared: 0.008684, Adjusted R-squared: 0.00264
## F-statistic: 1.437 on 1 and 164 DF, p-value: 0.2324
M_gender <- lm(Points ~ gender, data = Students2014)
summary(M_gender)
##
## Call:
## lm(formula = Points ~ gender, data = Students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3273 -3.3273 0.5179 4.5179 10.6727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.3273 0.5613 39.776 <2e-16 ***
## genderM 1.1549 0.9664 1.195 0.234
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared: 0.008632, Adjusted R-squared: 0.002587
## F-statistic: 1.428 on 1 and 164 DF, p-value: 0.2338
M_deep <- lm(Points ~ deep, data = Students2014)
summary(M_deep)
##
## Call:
## lm(formula = Points ~ deep, data = Students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6913 -3.6935 0.2862 4.9957 10.3537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.1141 3.0908 7.478 4.31e-12 ***
## deep -0.1080 0.8306 -0.130 0.897
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.913 on 164 degrees of freedom
## Multiple R-squared: 0.000103, Adjusted R-squared: -0.005994
## F-statistic: 0.01689 on 1 and 164 DF, p-value: 0.8967
M_stra <- lm(Points ~ stra, data = Students2014)
summary(M_stra)
##
## Call:
## lm(formula = Points ~ stra, data = Students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5581 -3.8198 0.1042 4.3024 10.1394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.233 1.897 10.141 <2e-16 ***
## stra 1.116 0.590 1.892 0.0603 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.849 on 164 degrees of freedom
## Multiple R-squared: 0.02135, Adjusted R-squared: 0.01538
## F-statistic: 3.578 on 1 and 164 DF, p-value: 0.06031
M_surf <- lm(Points ~ surf, data = Students2014)
summary(M_surf)
##
## Call:
## lm(formula = Points ~ surf, data = Students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6539 -3.3744 0.3574 4.4734 10.2234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.2017 2.4432 11.134 <2e-16 ***
## surf -1.6091 0.8613 -1.868 0.0635 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.851 on 164 degrees of freedom
## Multiple R-squared: 0.02084, Adjusted R-squared: 0.01487
## F-statistic: 3.49 on 1 and 164 DF, p-value: 0.06351
Looking at the output of the summaries, the most trustworthy explanatory variables are Attitude, Age, surf and stra. These are confirmed by the scatterplots of the individual dependences of points on the other variables. Age, Attitude and stra can be seen to probably have influence on points. The values of the regression make these assumption reasonable, with a quite low margin for failure. (Failure in this case means selecting output or in other words a random sample where the assumptions fail to reflect or descibe the truth/dependence of the variables).
The mean, standard error and variance for the dataset’s variables:
sapply(Students2014, mean, na.rm = TRUE)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## gender Age Attitude deep stra surf Points
## NA 25.512048 31.427711 3.679719 3.121235 2.787149 22.716867
sapply(Students2014, sd, na.rm = TRUE)
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm): Calling var(x) on a factor x is deprecated and will become an error.
## Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
## gender Age Attitude deep stra surf Points
## 0.4742358 7.7660785 7.2990794 0.5541369 0.7718318 0.5288405 5.8948836
sapply(Students2014, var, na.rm = TRUE)
## Warning in FUN(X[[i]], ...): Calling var(x) on a factor x is deprecated and will become an error.
## Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
## gender Age Attitude deep stra surf
## 0.2248996 60.3119752 53.2765608 0.3070677 0.5957244 0.2796722
## Points
## 34.7496532
Below is a regression model where exam points is the target/dependent variable, with three explanatory variables. These explanatory variables were chosen because it can be seen that they correlate with the variable that we are attempting to explain (points). At the same time the model is drawn in several different ways to help interpret and understand it’s relevance.
Model3 <- lm(formula = Points ~ Attitude + Age + stra, data = Students2014)
plot(Model3)
The following functions work in my R-project, but for some reason they refused to knit to HTML. I have taken measures to enable the knitting of error terms etc, and it still won’t work. (I used these functions to make some interesting plots that I discuss later in the exercise. These were not individually necessary for ch2, and I have provided a collection of the mentioned plots in an other way also. This collection is included in the code). Still, I wanted to post them here to show what I did in another way. Readers please note: These were not as such demanded, and are provided in the necessary form later. If experimentation is done on them, I recommend copying them to an R-document, along with all the other necessary elements
r.squared(Model3, model = NULL, type = c(“Attitude”, “Age”, “stra”), dfcor = TRUE) #Normal Q-Q r.squared(Model3, model = NULL, type = c(“Attitude”, “Age”, “stra”), dfcor = FALSE) r.squared(Model3, model = “lm”, type = c(“Attitude”, “Age”, “stra”), dfcor = TRUE) #Res vs Lev r.squared(Model3, model = “lm”, type = c(“Attitude”, “Age”, “stra”), dfcor = FALSE) #Res vs Fit
summary(Model3)
##
## Call:
## lm(formula = Points ~ Attitude + Age + stra, data = Students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## Attitude 0.34808 0.05622 6.191 4.72e-09 ***
## Age -0.08822 0.05302 -1.664 0.0981 .
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
The summary provides seemingly significant results, with considerably high correlation to points of all the other variables. The error has remained quite low in relation to the amount of variables used. The r-squared shows generally how close the data on the variables is to the regression line. Experimenting with different inputs to the function r-squared 4 different but interesting outputs can be found with this ampunt of experimentation. It seems that the Multiple R^2 is quite low, so it’s not too significant (It doesn’t deny the validity of the regression. The variants of the r-squared functions produced the Residuals vs Fitted, Residuals vs Leverage, and Normal Q-Q plots, as well as Scale location.
par(mfrow = c(2,2))
plot(Model3, which = c(1,2,5))
Linear regression models have a few general assumtions: 1. Linearity 2. The errors of the model are normally distributed. 2. The errors are not correlated. 3. The sizes of the errors do not depend on the variables used to explain the target variable.
Now let’s think how the plots produced correspond (or not) to these assumption:
The Q–Q-plot demonstrates how the standardised residuals of the model fit to the theory or reasoning behind the model. Therefore the normal distribution assumption seems to be true for the model.
The residuals vs. fitted values -plot does not seem to be regular/subjected to any pattern, meaning that the errors are not correlated to the explanatory variables and their size is independent.
Therefore all of the assumptions are valid for the model created.